1 Objectives

This practical is mostly reading and configuration but you should try and complete the 2 parts labelled Exercise: and submit the knitted html and a link to your github/gitlab repo via BrightSpace. This practical will not be graded but is your chance to make sure everything works properly and is submitted in the correct format for later practicals.

In terms of specific learning objectives:

  • Set up a reproducible R project with RStudio, an .Rproj file, and a Git repository.
  • Read, manipulate, and reshape tabular data using core tidyverse verbs (dplyr + tidyr).
  • Build layered visualisations with ggplot2.
  • Author a literate analysis with R Markdown that knits to HTML from a clean session.
  • Use Git to commit and push your work to GitHub/GitLab, with an appropriate .gitignore.
  • Produce a sessionInfo() footer so a reader knows exactly which package versions produced your results.

These skills underpin every subsequent practical in this course!

2 Why reproducibility matters in health data science

A “reproducible” analysis is one where another researcher (or future-you) can take your code and data and obtain the same results. In health research the stakes are higher than usual:

  • Findings inform clinical decisions and policy. Errors propagate.
  • Datasets are often access-controlled, so the code is the artefact others scrutinise most.
  • Regulators and journals increasingly require analysis code as a deliverable.

A practical reproducibility checklist for lab practicals:

  1. One project = one folder = one Git repository = one .Rproj file. No absolute paths like C:/Users/me/Desktop/....
  2. Code, data, and outputs are separated (e.g. R/, data/, figures/, output/).
  3. All package loads are explicit and at the top of the document.
  4. Random operations use set.seed().
  5. Knit from a clean R session (Session → Restart R) before submitting - this catches hidden state bugs.
  6. sessionInfo() is recorded at the bottom of the report.
  7. Raw data is read-only. Cleaning produces new files.

3 Terminology

  • R - the open-source statistical programming language used throughout this course.
  • RStudio / Posit - R’s most popular IDE. It bundles editor, console, plots, environment, and Git into one window.
  • Package (library) - R’s equivalent of a Python module. Installed once with install.packages() and loaded per session with library().
  • Tidyverse - a coherent collection of packages (dplyr, tidyr, ggplot2, readr, tibble, purrr, stringr, forcats, lubridate) that share design principles and the pipe-friendly grammar.
  • Tibble - the tidyverse’s modernised data frame. Behaves like data.frame but prints more cleanly and never silently changes types.
  • R Markdown / Quarto - literate programming formats: prose + executable code chunks, knitted to HTML/PDF/Word.
  • Git - the dominant decentralised version-control system.
  • GitHub / GitLab - hosted Git platforms. Dalhousie has its own GitLab at gitlab.cs.dal.ca.

4 Setting up your system

4.1 Install RStudio and Git

  1. Install R (≥ 4.1, so the native pipe |> is available): https://cran.r-project.org/
  2. Install RStudio Desktop: https://posit.co/download/rstudio-desktop/
  3. Install Git: https://git-scm.com/downloads

Verify Git from a terminal and add your details to the configuration:

git --version
git config --global user.name  "YOURNAME"
git config --global user.email "YOUREMAIL"

4.2 The RStudio panes

By default RStudio has four panes (configurable in Tools → Global Options → Pane Layout):

  • Editor (top-left) - where you write .R, .Rmd, and other source files.
  • Console / Terminal (bottom-left) - interactive R, plus a system shell tab.
  • Environment / History / Git (top-right) - current variables, command history, and Git status.
  • Files / Plots / Packages / Help / Viewer (bottom-right) - file browser, plot output, help pages.

Quick sanity check - type the following in the Console and press Enter:

x <- 2
x
## [1] 2

You should see x appear in the Environment pane.

4.3 Create an RStudio project linked to Git

The recommended workflow is GitHub/Gitlab-first:

  1. Create an empty repository on GitHub/GitLab with a sensible name e.g. arhds-labs (you can add a README and an R .gitignore template if prompted).
  2. In RStudio: File → New Project → Version Control → Git, paste the repo URL, choose a parent directory.
  3. RStudio creates a folder containing an .Rproj file. Always open that .Rproj to start work - it sets the working directory automatically, which is the foundation of path reproducibility.

A sensible starter .gitignore for R projects:

.Rhistory
.RData
.Ruserdata
.Rproj.user/
*.html
*.pdf
/data/raw/         # if data is large or non-redistributable
/renv/library/     # if using renv
  1. When submitting labs you must include a link to your github/gitlab.
  • Dalhousie Gitlab: when you create your repository if set the visibility level to internal it will be public for anyone logged into git.cs.dal.ca and you don’t need to do any other configuration. If you want to limit access, set it as private, create it, and then using the left-side menu Manage -> Members -> Invite Members and invite my csid finlaym to the repository.

  • Github: when you create your repository if you Choose visibility as public then anyone online can see it and you don’t need to do any other configuration. If you want to limit access: set it to private, create, and click on invite collaborators and add fmaguire to your repository.

4.4 Aside: Reproducible package-management: renv

For coursework, plain library() calls are fine. For your bigger research projects, you should consider using a utility like renv (renv::init()) to pin exact package versions in renv.lock so collaborators get the same environment.

5 R fundamentals (a quick refresher)

If you have never used R, have a look at the Harvard Chan Intro-R module material. We will go over the compressed key details.

5.1 Vectors

A vector is an ordered collection of values of the same type. R indexes from 1 (not 0) and supports negative and logical indexing. Note: the negative indexing works differently than other languages!

my_vector <- c(1, 4, 3, 2)
my_vector[2]              # second element
## [1] 4
my_vector[-2]             # all elements EXCEPT the second
## [1] 1 3 2
my_vector[2:3]            # second through third
## [1] 4 3
my_vector[my_vector > 2]  # logical indexing
## [1] 4 3
length(my_vector)
## [1] 4
mean(my_vector)
## [1] 2.5
sd(my_vector)
## [1] 1.290994
# Append
my_vector <- c(my_vector, 90)
my_vector <- c(30, my_vector)
my_vector
## [1] 30  1  4  3  2 90

5.2 Factors

A factor encodes a categorical variable. Levels are the allowed values; by default they are alphabetically ordered, which is rarely what you want.

expression <- c("low", "high", "medium", "high", "low", "medium", "high")

# Default: alphabetical ordering - usually wrong for ordinal data
factor(expression)
## [1] low    high   medium high   low    medium high  
## Levels: high low medium
# Specify a meaningful order
factor(expression, levels = c("low", "medium", "high"))
## [1] low    high   medium high   low    medium high  
## Levels: low medium high

In modern tidyverse code, prefer forcats (fct_relevel, fct_infreq, fct_lump) over base R for factor manipulation.

5.3 Data frames and tibbles

A data frame is a rectangular table whose columns may have different types. A tibble is the tidyverse drop-in replacement: same idea, better defaults.

# Base R
df_base <- data.frame(
  patient_id = c("P01", "P02", "P03"),
  age        = c(58, 64, 71),
  sex        = c("F", "M", "F"),
  sbp_mmhg   = c(132, 145, 128)   # systolic blood pressure
)
df_base
# Tibble equivalent
library(tibble)
tb <- tibble(
  patient_id = c("P01", "P02", "P03"),
  age        = c(58, 64, 71),
  sex        = c("F", "M", "F"),
  sbp_mmhg   = c(132, 145, 128)
)
tb

Notice the tibble prints column types (<chr>, <dbl>) - useful when debugging type coercion bugs.

5.4 The pipe

R has two pipes:

  • |> - the native pipe, built into R ≥ 4.1. No package required.
  • %>% - the magrittr pipe, loaded with dplyr/tidyverse. Older code uses this almost exclusively.

Both pass the left-hand side as the first argument of the right-hand side. For new code, prefer |>.

# These three are equivalent
exp(sqrt(16))                 # nested calls - read inside-out
## [1] 54.59815
sqrt(16) |> exp()             # native pipe (R 4.1+)
## [1] 54.59815
library(magrittr, quietly = TRUE)
sqrt(16) %>% exp()            # magrittr pipe
## [1] 54.59815

The pipe lets you read data transformations left-to-right, top-to-bottom, like a recipe.

6 The tidyverse, in one chunk

library(tidyverse)   # loads dplyr, tidyr, ggplot2, readr, tibble, purrr, stringr, forcats, lubridate

This lab will need the following 3 packages so you can install new packages like this:

install.packages(c("tidyverse", "datasauRus", "here"))

There is no need to run library(dplyr) and library(tidyverse) - the latter loads the former. Stick with library(tidyverse) for analysis scripts; load individual packages only when writing a package or a constrained Shiny app.

7 Data manipulation with dplyr

dplyr provides a small set of verbs that compose into rich pipelines. We will use a tiny synthetic clinical dataset throughout.

set.seed(2026)   # reproducibility: any random draws below give identical results every run

clinic <- tibble(
  patient_id = sprintf("P%03d", 1:8),
  age        = c(58, 64, 71, 49, 82, 33, 67, 55),
  sex        = c("F", "M", "F", "M", "F", "M", "F", "M"),
  smoker     = c(FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE),
  sbp_mmhg   = c(132, 145, 128, 118, 162, 121, 150, 135),   # systolic BP
  bmi        = c(27.4, 31.2, 24.8, 22.0, 29.5, 21.3, 33.1, 26.6)
)
clinic

7.1 select() - pick columns

clinic |> select(patient_id, age, sbp_mmhg)
# Helpers
clinic |> select(starts_with("s"))
clinic |> select(where(is.numeric))

Python equivalent: df[["patient_id", "age", "sbp_mmhg"]] or df.select_dtypes(include="number").

7.2 filter() - pick rows

# Hypertensive smokers
clinic |> filter(sbp_mmhg >= 140, smoker)
# Logical OR
clinic |> filter(age >= 65 | bmi >= 30)

Python equivalent: df.query("sbp_mmhg >= 140 and smoker").

7.3 mutate() - create or modify columns

clinic |>
  mutate(
    bp_category = case_when(
      sbp_mmhg <  120 ~ "normal",
      sbp_mmhg <  130 ~ "elevated",
      sbp_mmhg <  140 ~ "stage 1",
      TRUE            ~ "stage 2"
    ),
    obese = bmi >= 30
  )

case_when() is the multi-branch if/else of the tidyverse - much cleaner than nesting ifelse() calls.

7.4 arrange() - sort

clinic |> arrange(desc(sbp_mmhg))

7.5 summarise() and group_by() - collapse rows

clinic |>
  group_by(sex) |>
  summarise(
    n            = n(),
    mean_age     = mean(age),
    mean_sbp     = mean(sbp_mmhg),
    pct_smokers  = mean(smoker) * 100,
    .groups      = "drop"
  )

Python equivalent: df.groupby("sex").agg(...).

7.6 across() - apply a function to multiple columns

across() (introduced in dplyr 1.0) is the modern way to compute the same summary for many columns:

clinic |>
  group_by(sex) |>
  summarise(across(c(age, sbp_mmhg, bmi), \(x) mean(x, na.rm = TRUE)),
            .groups = "drop")

The \(x) ... syntax is R 4.1’s anonymous-function shorthand (equivalent to function(x) ... or lambda x: in python).

7.7 Other verbs worth knowing

  • rename(new = old) - rename columns.
  • relocate(col, .before = other) - reorder columns.
  • distinct() - drop duplicate rows.
  • slice_max(col, n = 5) / slice_min() / slice_sample(n = 100) - pick rows by rank or randomly.

8 Reshaping data with tidyr

Tidy data has three properties:

  1. Each variable is a column.
  2. Each observation is a row.
  3. Each cell is a single value.

Most messy datasets violate one of these. Two verbs do most of the work:

8.1 pivot_longer() - wide → long

(pivot_longer() replaces the older gather(). You may still see gather() in older code; it works but is no longer recommended.)

life_expectancy <- tribble(
  ~country,    ~`2010`, ~`2015`, ~`2020`,
  "Australia",  82.0,    82.4,    83.0,
  "Canada",     80.7,    81.5,    81.9,
  "France",     81.8,    82.3,    83.0
)
life_expectancy
le_long <- life_expectancy |>
  pivot_longer(
    cols      = -country,        # everything except country
    names_to  = "year",
    values_to = "expectancy"
  ) |>
  mutate(year = as.integer(year))

le_long

8.2 pivot_wider() - long → wide

(pivot_wider() replaces spread().)

le_long |>
  pivot_wider(names_from = year, values_from = expectancy)

8.3 Other useful tidyr verbs

# Split one column into many
tibble(x = c("a_1", "b_2")) |>
  separate(x, into = c("letter", "number"), sep = "_")
# Carry the last observation forward (e.g. visit dates)
tibble(visit = c(1, NA, NA, 4)) |>
  fill(visit)
# Drop rows with any missing
tibble(x = c(1, 2, NA), y = c(3, NA, 5)) |>
  drop_na()
# Replace specific NAs
tibble(x = c(1, 2, NA), y = c(3, NA, 5)) |>
  replace_na(list(x = 0, y = 99))

9 Reading and writing data with readr and here

Hard-coded paths break reproducibility. The here package resolves paths relative to the project root (the folder containing your .Rproj):

library(here)

# Write the clinic tibble to data/processed/
dir.create(here("data", "processed"), recursive = TRUE, showWarnings = FALSE)
write_csv(clinic, here("data", "processed", "clinic.csv"))

# Read it back - works on any machine, regardless of where the project lives
clinic2 <- read_csv(here("data", "processed", "clinic.csv"))

readr functions (read_csv, read_tsv, read_delim) are faster than base R’s read.csv and never silently coerce strings to factors.

10 Visualisation with ggplot2

ggplot2 implements the Grammar of Graphics: every plot is a stack of layers built from data + aesthetic mappings + geometric objects + scales + coordinate system + theme.

data(mpg)   # built-in fuel-economy dataset

# A plot is built up with `+`, NOT the pipe.
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

Add aesthetic mappings - colour by class:

ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
  geom_point(alpha = 0.7) +
  labs(
    x      = "Engine displacement (L)",
    y      = "Highway MPG",
    colour = "Vehicle class",
    title  = "Larger engines deliver lower fuel economy"
  ) +
  theme_minimal()

Common geoms:

# Bar chart of counts
ggplot(mpg, aes(x = class)) + geom_bar() + theme_minimal()

# Histogram
ggplot(mpg, aes(x = hwy)) + geom_histogram(bins = 20) + theme_minimal()

# Density
ggplot(mpg, aes(x = hwy, fill = drv)) +
  geom_density(alpha = 0.4) +
  theme_minimal()

# Faceting - small multiples
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ class) +
  theme_minimal()

11 Exercise: Datasaurus

This exercise demonstrates why we visualise data: thirteen datasets with nearly identical summary statistics but wildly different shapes.

library(datasauRus)

datasaurus_dozen |>
  count(dataset)

The original Datasaurus is from Alberto Cairo’s blog post; the rest are from Matejka & Fitzmaurice’s Same Stats, Different Graphs (CHI 2017).

Q1. How many rows and columns does datasaurus_dozen contain, and what are the variables?

Q2. Uncomment and complete the ggplot code to plot y vs x for the dino subset, and compute the Pearson correlation.

dino_data <- datasaurus_dozen |>
  filter(dataset == "dino")

#ggplot(dino_data, aes( # complete...

dino_data |> summarise(r = cor(x, y))

Q3. Repeat for the star dataset. Compare its r to that of dino.

Q4. Repeat for the circle dataset. Compare its r to that of dino.

Q5. Complete the following code to visualise all the datasets at once with faceting and calculate the statistics as a single grouped summarise command.

ggplot(datasaurus_dozen, aes(x = x, y = y, colour = dataset)) +
  geom_point(size = 0.7) +
  facet_wrap(~ dataset, ncol = 3) +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text       = element_blank(),
        axis.ticks      = element_blank())

datasaurus_dozen |>
  group_by(dataset) |>
  summarise(
    mean_x = mean(x),
    #... complete this to calculate the mean for y, standard deviation for x and y, and pearson correlation
    .groups = "drop"
  )

Q6. Write 2–3 sentences in your knitted document on why these summary statistics are nearly identical despite the obvious visual differences, and what this implies for exploratory data analysis on real clinical datasets.

12 Exercise: Air Quality Data

R ships with airquality (daily air-quality measurements, NY 1973). Despite its age it is a useful, mildly messy dataset for practising the verbs above.

Q7. Using airquality:

  1. Drop rows where Ozone is missing.
  2. Add a column month_name with the month spelled out ("May", "Jun", …). Hint: month.abb is a built-in vector.
  3. Compute mean Ozone, mean Temp, and the count of complete days per month.
  4. Plot Ozone against Temp, coloured by month, with a smoothed trend line (geom_smooth(method = "lm")).
aq <- as_tibble(airquality)

#...

13 Git workflow for this practical

A minimal commit cycle from the RStudio Terminal pane (or a system shell):

git status                               # what changed?
git add lab0_reproducible_research_tidyverse.Rmd
git commit -m "Lab 0: complete tidyverse + ggplot exercises"
git push

You can also use the Git tab in RStudio’s top-right pane: tick the files to stage, click Commit, write a message, click Push.

Best practice:

  • Commit early and often. Small commits are easier to review and revert.
  • Write present-tense, imperative messages (“Add Q5 facet plot”, not “Added the plot”).
  • Don’t commit your knitted HTML or PDFs - they are derived artefacts. The .gitignore above already excludes them.

15 Submission

For each practical you will submit:

  1. The knitted HTML of your R Markdown notebook to Brightspace. Knit from a clean session (Session → Restart R → Knit).
  2. A link to the source .Rmd in your Git repository (public, or shared with github:fmaguire / gitlab.cs.dal.ca:finlaym if private - see explanation above).

Due midnight before the next week’s practical.

16 Optional further reading